housing_df = read.csv("housing.csv")[,-1]
## first column is Id so it can be removed
1. Take a summary of the data and explore the result. How many categorical and numerical variables are there in the dataset?
summary(housing_df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig
Min. : 20.0 Length:1460 Min. : 21.00 Min. : 1300 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460
1st Qu.: 20.0 Class :character 1st Qu.: 59.00 1st Qu.: 7554 Class :character Class :character Class :character Class :character Class :character Class :character
Median : 50.0 Mode :character Median : 69.00 Median : 9478 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Mean : 56.9 Mean : 70.05 Mean : 10517
3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
Max. :190.0 Max. :313.00 Max. :215245
NA's :259
LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 6.000 Median :5.000 Median :1973 Median :1994
Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Length:1460 Length:1460 Length:1460
Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0 Mode :character Mode :character Mode :character Mode :character
Mean : 103.7
3rd Qu.: 166.0
Max. :1600.0
NA's :8
BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460 Length:1460
Class :character Class :character Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character
Mode :character Mode :character Mode :character Median : 383.5 Mode :character Median : 0.00 Median : 477.5 Median : 991.5 Mode :character Mode :character
Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
Length:1460 Length:1460 Min. : 334 Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
Class :character Class :character 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
Mode :character Mode :character Median :1087 Median : 0 Median : 0.000 Median :1464 Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
Mean :1163 Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :4692 Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea
Min. :0.000 Length:1460 Min. : 2.000 Length:1460 Min. :0.000 Length:1460 Length:1460 Min. :1900 Length:1460 Min. :0.000 Min. : 0.0
1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961 Class :character 1st Qu.:1.000 1st Qu.: 334.5
Median :1.000 Mode :character Median : 6.000 Mode :character Median :1.000 Mode :character Mode :character Median :1980 Mode :character Median :2.000 Median : 480.0
Mean :1.047 Mean : 6.518 Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
Max. :3.000 Max. :14.000 Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
NA's :81
GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:1460
Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Class :character
Mode :character Mode :character Mode :character Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000 Mode :character
Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Length:1460 Length:1460 Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460 Length:1460 Min. : 34900
Class :character Class :character 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
Mode :character Mode :character Median : 0.00 Median : 6.000 Median :2008 Mode :character Mode :character Median :163000
Mean : 43.49 Mean : 6.322 Mean :2008 Mean :180921
3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 3rd Qu.:214000
Max. :15500.00 Max. :12.000 Max. :2010 Max. :755000
length(housing_df[,sapply(housing_df, is.character) == TRUE]) ## Number of categorical variables
[1] 43
length(housing_df[,sapply(housing_df, is.numeric) == TRUE]) ## Number of numerical variables
[1] 37
2. (1pt) Which columns have missing values and what percentage of those columns have NAs? (Note. You can use colMeans(is.na(your data frame)) to find the percentage of NAs in each column).
missing_vals_columns = which(colSums(is.na(housing_df)) > 0)
sort(colSums(sapply(housing_df[missing_vals_columns], is.na)), decreasing = TRUE)
PoolQC MiscFeature Alley Fence FireplaceQu LotFrontage GarageType GarageYrBlt GarageFinish GarageQual GarageCond BsmtExposure BsmtFinType2 BsmtQual
1453 1406 1369 1179 690 259 81 81 81 81 81 38 38 37
BsmtCond BsmtFinType1 MasVnrType MasVnrArea Electrical
37 37 8 8 1
The above columns have missing values
means = colMeans(sapply(housing_df[missing_vals_columns], is.na))
newmeans = as.numeric(sprintf("%2.3f", means)) * 100
names(newmeans) = names(means)
sort(newmeans, decreasing = TRUE)
PoolQC MiscFeature Alley Fence FireplaceQu LotFrontage GarageType GarageYrBlt GarageFinish GarageQual GarageCond BsmtExposure BsmtFinType2 BsmtQual
99.5 96.3 93.8 80.8 47.3 17.7 5.5 5.5 5.5 5.5 5.5 2.6 2.6 2.5
BsmtCond BsmtFinType1 MasVnrType MasVnrArea Electrical
2.5 2.5 0.5 0.5 0.1
The above are the % NAs in each column with NAs
3. Is there any obvious outlier in the SalePrice? If so, remove them
## Before transformation
str(housing_df$SalePrice)
int [1:1460] 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
first_quant = quantile(housing_df$SalePrice)[2]
third_quant = quantile(housing_df$SalePrice)[4]
iqr = third_quant - first_quant
iqr_index = (housing_df$SalePrice > first_quant - 1.5*iqr & housing_df$SalePrice < third_quant + 1.5*iqr)
housing_df = housing_df[iqr_index, ]
## After transformation
str(housing_df$SalePrice)
int [1:1399] 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
4. (2pt)Read the data description carefully. For some of the variables, such as PoolQC, FirePlaceQU, Fence, etc. NA means not applicable rather than missing at random. For instance, a house that does not have a pool gets NA for PoolQC. For those variables for which NA means not applicable, you can replace NA with zero ( if that variable is numeric) or replace it with a new category/level, for instance, “notApplicable” if that variable is categorical.
# Cleaning categorical variables
cat_columns = c("Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature")
new_columns = lapply(cat_columns, function(x){
housing_df[is.na(housing_df[,x]),x] <<- "notApplicable"
})
# Cleaning numerical variables
num_columns = (sapply(housing_df, is.numeric) == TRUE) & (colSums(is.na(housing_df)) > 0)
new_columns = lapply(names(housing_df[,num_columns]), function(x){
housing_df[is.na(housing_df[,x]),x] <<- 0
})
5. (1pt) After replacing not applicable NAs with appropriate values, find out which columns still have NAs and what percentage of each column is missing.
missing_vals_columns = which(colSums(is.na(housing_df)) > 0)
means = colMeans(sapply(housing_df[missing_vals_columns], is.na))
newmeans = as.numeric(sprintf("%2.3f", means)) * 100
names(newmeans) = names(means)
sort(newmeans, decreasing = TRUE)
MasVnrType Electrical
0.5 0.1
6. (1pt) what percentage of rows in the dataset have one or more missing values? Use “complete.cases” function to answer this question.
missing_count = length(which(!complete.cases(housing_df)))
as.numeric(sprintf("%2.3f",missing_count/1399)) * 100 # 1399 is the number of observations in the data frame
[1] 0.6
0.6% of rows have missing values
8. (1pt) plot the histogram of SalePrice. Interpret the histogram. Is SalePrice variable skewed? To replace SalePrice with log(SalePrice. Compare the histogram of salesprice before and after log transformation.
hist(housing_df$SalePrice, xlab = "SalePrice", main = "Sale Price")
Yes, the SalePrice variable is right skewed
hist(log(housing_df$SalePrice), xlab = "Log SalePrice", main = "Sale Price")
The histogram of Log SalePrice is left skewed.
9. (2 pt) Use plot (SalePrice~. , data=housing) (replace housing with your dataframe after data cleaning) to draw scatter and side by side box plots of other variables against the Sale Price. From these plots, what variables seem to have correlation with SalePrice? (Note since we have so many variables, you do not need to use statistics tests, you can just answer this question based on your observations of the plots)
attach(housing_df)
The plots for categorical variables
colName = names(housing_df)
correlated_variables = vector(mode = "character")
temp = lapply(colName, function(col) {
if (is.character(housing_df[,col])) {
chisq_temp = chisq.test(SalePrice, as.factor(housing_df[,col]))
if (chisq_temp$p.value < 0.006) {
correlated_variables <<- append(correlated_variables, col)
}
plot(SalePrice ~ as.factor(housing_df[,col]), xlab = col)
}
})
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
non_scatter = c("MSSubClass","OverallQual","OverallCond","BsmtFullBath","BsmtFullBath","FullBath","HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageCars","MoSold","YrSold")
temp1 = lapply(non_scatter, function(col) {
chisq_temp = chisq.test(SalePrice, as.factor(housing_df[,col]))
if (chisq_temp$p.value < 0.006) {
correlated_variables <<- append(correlated_variables, col)
}
plot(SalePrice ~ as.factor(housing_df[,col]), xlab = col)
})
Chi-squared approximation may be incorrectChi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
Chi-squared approximation may be incorrect
The scatter plots for numeric variables
col_name = names(housing_df)
scatter_cols = setdiff(col_name, non_scatter)
temp = lapply(scatter_cols, function(col) {
if (is.numeric(housing_df[,col]) & col != "SalePrice") {
cor_temp = cor.test(SalePrice, housing_df[,col])
if (cor_temp$estimate >= 0.5 | cor_temp$estimate <= -0.5) {
correlated_variables <<- append(correlated_variables, col)
}
plot(SalePrice ~ housing_df[,col], xlab = col)
}
return()
})
NA
Below are the variables I’ve interpreted to be correlated with SalePrice
correlated_variables
[1] "MSZoning" "Street" "LotShape" "Neighborhood" "Condition2" "MasVnrType" "ExterQual" "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
[12] "BsmtExposure" "Heating" "CentralAir" "Electrical" "KitchenQual" "FireplaceQu" "GarageFinish" "SaleType" "SaleCondition" "OverallQual" "OverallCond"
[23] "FullBath" "HalfBath" "TotRmsAbvGrd" "GarageCars" "YearBuilt" "YearRemodAdd" "TotalBsmtSF" "X1stFlrSF" "GrLivArea" "GarageArea"
10. (2 pt) Examine the columns with missing values to see if any of them are categorical. Use caret’s createDataPartition method to partition the dataset to 80% training and 20% testing. If a categorical column has missing values in train or test data, impute it with the mode of that column in the training data. It is important that the mode is computed based only on the training data only (instead of the entire dataset) to avoid data leakage.
These are the categorical columns with missing values
which(colSums(is.na(housing_df)) > 0 & sapply(housing_df, is.character))
MasVnrType Electrical
25 42
library(caret)
# creating data partition
test_indexes = createDataPartition(y = housing_df$SalePrice, p = 0.8)
test_data = housing_df[test_indexes$Resample1, ]
train_data = housing_df[-test_indexes$Resample1, ]
# define getmode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Impute test data
masVnrType_na = is.na(test_data$MasVnrType)
electrical_na = is.na(test_data$Electrical)
test_data$MasVnrType[masVnrType_na] = getmode(test_data$MasVnrType)
test_data$Electrical[electrical_na] = getmode(test_data$Electrical)